Jonic Mecija
Sentiment Analysis performed on kaggle's "Fine food reviews dataset"
import pandas as pd
df = pd.read_csv('Reviews.csv')
df.head()
| Id | ProductId | UserId | ProfileName | HelpfulnessNumerator | HelpfulnessDenominator | Score | Time | Summary | Text | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | B001E4KFG0 | A3SGXH7AUHU8GW | delmartian | 1 | 1 | 5 | 1303862400 | Good Quality Dog Food | I have bought several of the Vitality canned d... |
| 1 | 2 | B00813GRG4 | A1D87F6ZCVE5NK | dll pa | 0 | 0 | 1 | 1346976000 | Not as Advertised | Product arrived labeled as Jumbo Salted Peanut... |
| 2 | 3 | B000LQOCH0 | ABXLMWJIXXAIN | Natalia Corres "Natalia Corres" | 1 | 1 | 4 | 1219017600 | "Delight" says it all | This is a confection that has been around a fe... |
| 3 | 4 | B000UA0QIQ | A395BORC6FGVXV | Karl | 3 | 3 | 2 | 1307923200 | Cough Medicine | If you are looking for the secret ingredient i... |
| 4 | 5 | B006K2ZZ7K | A1UQRSCLF8GW1T | Michael D. Bigham "M. Wassir" | 0 | 0 | 5 | 1350777600 | Great taffy | Great taffy at a great price. There was a wid... |
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
# Reviewing the product scores to see if customer ratings are positive or negative
fig = px.histogram(df, x="Score")
fig.update_traces(marker_color="purple", marker_line_color='rgb(8,48,107)', marker_line_width=1.5)
fig.update_layout(title_text='Product Score')
fig.show()
# assign reviews with score > 3 as positive sentiment
# score < 3 negative sentiment
# remove score = 3
df = df[df['Score'] != 3]
df['sentiment'] = df['Score'].apply(lambda rating : +1 if rating > 3 else -1)
# split the data frame into positive and negative
positive = df[df['sentiment'] == 1]
negative = df[df['sentiment'] == -1]
# distribution of reviews conclude that reviews are mostly positive
df['sentimentt'] = df['sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})
fig = px.histogram(df, x="sentimentt")
fig.update_traces(marker_color="indianred",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Product Sentiment')
fig.show()
We will train a logistic regression model to predict whether new reviews are positive or negative.
# clean the data of punctuation
def remove_punctuation(text):
final = "".join(u for u in text if u not in ("?", ".", ";", ":", "!",'"'))
return final
df['Text'] = df['Text'].apply(remove_punctuation)
df = df.dropna(subset=['Summary'])
df['Summary'] = df['Summary'].apply(remove_punctuation)
dfNew = df[['Summary','sentiment']]
dfNew.head()
| Summary | sentiment | |
|---|---|---|
| 0 | Good Quality Dog Food | 1 |
| 1 | Not as Advertised | -1 |
| 2 | Delight says it all | 1 |
| 3 | Cough Medicine | -1 |
| 4 | Great taffy | 1 |
import numpy as np
# random split train and test data
index = df.index
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]
# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
# Convert a collection of text documents to a matrix of token counts
# all possible words in matrix
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['Summary'])
test_matrix = vectorizer.transform(test['Summary'])
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']
lr.fit(X_train,y_train)
C:\Users\Jonic\anaconda3\envs\GPU\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
LogisticRegression()
predictions = lr.predict(X_test)
# find accuracy, precision, recall:
from sklearn.metrics import confusion_matrix,classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)
# predict no predict yes
# _______________________________
# actual No | true negative false positive
# |
# actual yes | false negative true positive
array([[11467, 2285],
[ 5934, 91845]], dtype=int64)
print(classification_report(predictions,y_test))
# -1 negative reviews
# 1 positive reviews
# precision: tp / (tp + fp)
# recall: tp / (tp + fn)
# f1-score: 2 * (precision * recall) / (precision + recall)
# support: The support is the number of occurrences of each class in y_true
# macro: Calculate metrics for each label, and find their unweighted mean.
# This does not take label imbalance into account.
# weighted: Calculate metrics for each label, and find their
# average weighted by support (the number of true instances for each label).
# This alters ‘macro’ to account for label imbalance; it can result in an F-score
# that is not between precision and recall.
precision recall f1-score support
-1 0.66 0.83 0.74 13752
1 0.98 0.94 0.96 97779
accuracy 0.93 111531
macro avg 0.82 0.89 0.85 111531
weighted avg 0.94 0.93 0.93 111531